Final Project Assignment#1: Cam Needels

final_Project_assignment_1
final_project_data_description
Project & Data Description
Author

Cam Needels

Published

April 12, 2023

library(tidyverse)
library(readr)
library(ggplot2)
library(summarytools)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Part 1. Introduction

  1. This data set is trending YouTube videos from August 11th, 2020 until April 10th, 2023 in the United States. It’s collected from the top trending YouTube video list provided by YouTube itself. This is created in order to find the top-trending videos that drive the most traffic to the website. Since YouTube has so much data from all over the world and because different videos trend in different countries, this data set is specifically the US data. Each case represents a video that is trending on a certain date (this means the same video can be present multiple times because it can trend for multiple dates).

  2. Which topics have the most amount of videos trending? Which variables strongly relate to higher view count? Which channels trended the most? Which videos were the most viewed in this time period?

Part 2. Describe the data set(s)

As stated above this data set takes the trending YouTube videos from August 11th, 2020 until April 10th, 2023 in the US. It contains 195390 trending videos on different dates (with some duplicates). There are 36572 unique videos in this data set. This data set has 16 columns including the view count, title, publish date, channel name, category id (instead of the name it makes each category a specific number, this actually makes it easier for my statistical analysis section), youtube channel id, likes, dislikes, comment count, the date the video was trending, the tags on the YouTube video, the thumbnail link, if comments were disabled or not, if ratings were disabled or not, and description. This is a large amount of data and I’m considering getting rid of duplicates to make it more manageable but I haven’t decided if that’s the route I want to go. I’m interested in mostly the comment_count, dislikes, likes, and view_count as those are the most important variables for YouTube videos.

YouTube <- read.csv("B:/Needels/Documents/DACCS 601/DACSS_601_New/posts/CamNeedels_FinalProjectData/US_youtube_trending_data.csv")
YouTube
#the dimensions
dim(YouTube)
[1] 195390     16
#the amount of unique videos
length(unique(YouTube$title))
[1] 36572
descr(YouTube)
Descriptive Statistics  
YouTube  
N: 195390  

                    categoryId   comment_count    dislikes         likes     view_count
----------------- ------------ --------------- ----------- ------------- --------------
             Mean        18.81        10751.38     1560.36     129278.13     2495243.09
          Std.Dev         6.76        81607.96     9403.21     406543.19     7042708.85
              Min         1.00            0.00        0.00          0.00           0.00
               Q1        17.00         1328.00        0.00      18598.00      478516.00
           Median        20.00         2936.00       45.00      42744.50      965991.50
               Q3        24.00         6922.00      864.00     106406.00     2162871.00
              Max        29.00      6738537.00   879354.00   16021534.00   277791741.00
              MAD         5.93         2997.82       66.72      45137.02      898375.54
              IQR         7.00         5594.00      864.00      87807.00     1684349.00
               CV         0.36            7.59        6.03          3.14           2.82
         Skewness        -1.09           40.08       42.97         14.65          14.26
      SE.Skewness         0.01            0.01        0.01          0.01           0.01
         Kurtosis         0.42         2107.45     3035.89        333.20         321.52
          N.Valid    195390.00       195390.00   195390.00     195390.00      195390.00
        Pct.Valid       100.00          100.00      100.00        100.00         100.00

3. The Tentative Plan for Visualization

  1. I plan on using a heat map to show how closely related the view count is to the comment_count, dislikes, likes. I want to see which of those have the greatest correlation. We haven’t learned how to make a heatmap, but I’ll do some research and figure it out because that seems to make the most sense. If I leave the variables as is and make essentially a battleship map of how much they relate on a scale from 0% to 100 depending on the color. For all the other questions I will most likely use a bar graph to show the highest amount of videos and take only the top ten sample for the trending topics, channels, and videos. Such as the video on the x axis and the view count on the y axis to figure out the trending videos. I will most likely need to lubridate the current dates that are inserted so they can be created separately and fortunately for me there are no missing data points. I also plan on using the uniqe function so I don’t get repeats of the same video in the top 10 because I feel like it’s redundant to have the same video twice in that analysis.